Statistical Evaluation Metrics

Foundation

Confusion Matrix

The cornerstone of all classification metrics — a table that summarizes a model's correct and incorrect predictions.

📐 What Is a Confusion Matrix?

A Confusion Matrix is a table used to visualize the performance of a classification model. For binary classification it is a 2×2 matrix with four key components:

	Predicted
Actual	Positive (+)	Negative (−)
Positive (+)	TP True Positive	FN False Negative
Negative (−)	FP False Positive	TN True Negative

✅

True Positive (TP)

Actually positive and correctly predicted as positive. E.g., a sick patient correctly diagnosed as sick.

❌

False Positive (FP)

Actually negative but incorrectly predicted as positive. Also known as a Type I Error. E.g., a healthy person flagged as sick.

⚠️

False Negative (FN)

Actually positive but incorrectly predicted as negative. Also known as a Type II Error. E.g., a sick patient missed by the test.

🟢

True Negative (TN)

Actually negative and correctly predicted as negative. E.g., a healthy person confirmed as healthy.

💡 Real-World Example

Consider an email spam filter classifying 1,000 emails:
TP = 80 (spam correctly caught) · FP = 10 (legitimate emails marked as spam)
FN = 20 (spam emails missed) · TN = 890 (legitimate emails correctly passed)

Most Basic Metric

Accuracy

The most intuitive metric — but one that can be misleading when classes are imbalanced.

📏 Definition & Formula

Accuracy is the ratio of correct predictions (both positive and negative) over the total number of predictions.

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

💡 Calculation Example

Spam filter: $\text{Accuracy} = \frac{80 + 890}{80 + 890 + 10 + 20} = \frac{970}{1000} = \mathbf{0.97}$ → 97%

⚠️ The Accuracy Paradox — Imbalanced Datasets

Accuracy can be misleading with class imbalance. For example, if only 10 out of 1,000 patients have cancer, a model that never predicts cancer still achieves 99% accuracy — yet fails to detect a single case! Use Precision, Recall, and F1 Score for imbalanced data.

Positive-Focused

Precision

Measures how trustworthy the model's positive predictions are.

🎯 Definition & Formula

Precision is the fraction of positive predictions that are actually correct. It answers: "Of everything I labeled positive, how much was truly positive?"

$$ \text{Precision} = \frac{TP}{TP + FP} $$

💡 Calculation Example

Spam filter: $\text{Precision} = \frac{80}{80 + 10} = \frac{80}{90} = \mathbf{0.889}$ → 88.9%
Out of 90 emails flagged as spam, 80 were actually spam.

🔑 When Does Precision Matter?

Focus on Precision when false positives are costly:
• Spam Filter: Blocking an important email is risky
• Search Engine: Irrelevant results degrade user experience
• Recommendation System: Wrong recommendations erode trust

Sensitivity-Focused

Recall (Sensitivity)

Measures how many of the actual positive cases the model successfully captures.

🔍 Definition & Formula

Recall (also called Sensitivity or True Positive Rate) is the fraction of actual positives that the model correctly identified. It answers: "Of all true positives, how many did I catch?"

$$ \text{Recall} = \frac{TP}{TP + FN} $$

💡 Calculation Example

Spam filter: $\text{Recall} = \frac{80}{80 + 20} = \frac{80}{100} = \mathbf{0.80}$ → 80%
Out of 100 actual spam emails, our model caught 80 and missed 20.

🔑 When Does Recall Matter?

Focus on Recall when false negatives are costly:
• Cancer Screening: Missing a patient can be life-threatening
• Fraud Detection: Missed fraud leads to major financial loss
• Security Systems: Undetected threats pose serious risks

⚖️ The Precision–Recall Trade-off

Precision and Recall typically move in opposite directions. Lowering the decision threshold captures more positives (Recall ↑) but also increases false positives (Precision ↓). The F1 Score balances this trade-off.

Balanced Metric

F1 Score

A single number that balances Precision and Recall using the harmonic mean.

⚖️ Definition & Formula

The F1 Score is the harmonic mean of Precision and Recall. The harmonic mean is used instead of the arithmetic mean because it penalizes extreme differences between the two values more heavily.

$$ F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} $$

💡 Calculation Example

Spam filter: $F_1 = 2 \times \frac{0.889 \times 0.80}{0.889 + 0.80} = 2 \times \frac{0.711}{1.689} = \mathbf{0.842}$ → 84.2%

📊 F-Beta Score — Weighted Variant

When you want to weight Precision or Recall differently, use the F-Beta Score:

$$ F_\beta = (1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}} $$

Score	β Value	Weight	Use Case
F0.5	0.5	Favors Precision	Spam filters, search engines
F1	1.0	Equal weight	General classification problems
F2	2.0	Favors Recall	Medical diagnosis, security systems

Negative-Focused

Specificity

Measures how well the model identifies true negatives.

🛡️ Definition & Formula

Specificity (True Negative Rate) is the proportion of actual negatives correctly identified by the model. It is the counterpart of Recall for the negative class.

$$ \text{Specificity} = \frac{TN}{TN + FP} $$

💡 Calculation Example

Spam filter: $\text{Specificity} = \frac{890}{890 + 10} = \frac{890}{900} = \mathbf{0.989}$ → 98.9%
98.9% of legitimate emails were correctly identified.

📌 Sensitivity vs. Specificity

Sensitivity (Recall): How well do we detect the sick?
Specificity: How well do we identify the healthy?
Together they form the basis of the ROC curve.

Model Comparison

ROC Curve & AUC

A powerful visualization and comparison tool that shows model performance across all threshold values.

📈 What Is the ROC Curve?

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (Recall) against the False Positive Rate (1 − Specificity) at various classification thresholds.

$$ \text{TPR} = \frac{TP}{TP + FN} \quad\quad \text{FPR} = \frac{FP}{FP + TN} $$

🏆 AUC (Area Under the Curve)

AUC is the area under the ROC curve. It ranges from 0 to 1 and measures the model's overall ability to distinguish between classes.

AUC Value	Rating	Interpretation
1.0	Perfect	Model perfectly separates all classes
0.9 – 1.0	Excellent	High discriminative power
0.7 – 0.9	Good	Acceptable performance
0.5 – 0.7	Poor	Slightly better than random
0.5	Random	Equivalent to a coin flip

🔑 Why AUC?

AUC is threshold-independent and ideal for comparing models. It is more reliable than Accuracy for imbalanced datasets.

Probability-Based

Log Loss (Logarithmic Loss)

Evaluates the quality of predicted probabilities — not just right or wrong, but how confident the model is.

📉 Definition & Formula

Log Loss (Binary Cross-Entropy) measures how well the model's predicted probabilities match the true labels. Lower Log Loss = better model.

$$ \text{Log Loss} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \cdot \log(\hat{y}_i) + (1-y_i) \cdot \log(1-\hat{y}_i)\right] $$

Where $y_i$ is the actual label (0 or 1), $\hat{y}_i$ is the predicted probability, and $N$ is the total number of samples.

🔑 When to Use Log Loss?

• When the model's confidence level matters, not just the class label
• When performing probability calibration
• Frequently used as a scoring metric in Kaggle competitions

Regression

Regression Metrics

Error and goodness-of-fit measures for models that predict continuous values.

📐 MAE — Mean Absolute Error

The average of absolute differences between predicted and actual values. Robust to outliers.

$$ \text{MAE} = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i| $$

📊 MSE — Mean Squared Error

The average of squared differences between predicted and actual values. Penalizes larger errors more heavily.

$$ \text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 $$

📏 RMSE — Root Mean Squared Error

The square root of MSE, bringing the error back to the original scale. The most widely used regression metric.

$$ \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2} $$

🎯 R² — Coefficient of Determination

Indicates how much of the variance in the dependent variable is explained by the model. Closer to 1 = better fit.

$$ R^2 = 1 - \frac{\sum_{i=1}^{N}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{N}(y_i - \bar{y})^2} $$

R² Value	Interpretation
1.0	Perfect fit — model explains all variance
0.7 – 1.0	Good fit
0.4 – 0.7	Moderate fit
< 0.4	Weak fit — model doesn't explain data well
< 0	Model is worse than predicting the mean

Decision Guide

Choosing the Right Metric

Selecting the right metric is as important as selecting the right model.

Scenario	Recommended Metric	Why?
Balanced dataset	Accuracy, F1	Accuracy is reliable when classes are balanced
Imbalanced dataset	F1, AUC, Precision/Recall	Accuracy can be misleading
False positives are costly	Precision	Minimize FP
False negatives are costly	Recall	Minimize FN
Probability estimates matter	Log Loss, AUC	Evaluates model confidence
Model comparison	AUC	Threshold-independent comparison
Continuous value prediction	RMSE, MAE, R²	Designed for regression problems

Code Examples

Python Implementation

Computing all metrics with scikit-learn.

🐍 Classification Metrics with Scikit-learn

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report,
    roc_auc_score, log_loss
)
import numpy as np

# Ground truth and predicted labels
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

# Compute all metrics
print("Accuracy :", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall   :", recall_score(y_true, y_pred))
print("F1 Score :", f1_score(y_true, y_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))

# Detailed report
print("\nClassification Report:")
print(classification_report(y_true, y_pred))

📊 Regression Metrics

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score
)
import numpy as np

y_true = np.array([3.0, 5.0, 2.5, 7.0, 4.5])
y_pred = np.array([2.8, 5.2, 2.1, 6.8, 4.9])

print("MAE :", mean_absolute_error(y_true, y_pred))
print("MSE :", mean_squared_error(y_true, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_true, y_pred)))
print("R²  :", r2_score(y_true, y_pred))

📑 Table of Contents

Confusion Matrix

📐 What Is a Confusion Matrix?

True Positive (TP)

False Positive (FP)

False Negative (FN)

True Negative (TN)

💡 Real-World Example

Accuracy

📏 Definition & Formula

💡 Calculation Example

⚠️ The Accuracy Paradox — Imbalanced Datasets

Precision

🎯 Definition & Formula

💡 Calculation Example

🔑 When Does Precision Matter?

Recall (Sensitivity)

🔍 Definition & Formula

💡 Calculation Example

🔑 When Does Recall Matter?

⚖️ The Precision–Recall Trade-off

F1 Score

⚖️ Definition & Formula

💡 Calculation Example

📊 F-Beta Score — Weighted Variant

Specificity

🛡️ Definition & Formula

💡 Calculation Example

📌 Sensitivity vs. Specificity

ROC Curve & AUC

📈 What Is the ROC Curve?

🏆 AUC (Area Under the Curve)

🔑 Why AUC?

Log Loss (Logarithmic Loss)

📉 Definition & Formula

🔑 When to Use Log Loss?

Regression Metrics

📐 MAE — Mean Absolute Error

📊 MSE — Mean Squared Error

📏 RMSE — Root Mean Squared Error

🎯 R² — Coefficient of Determination

Choosing the Right Metric

Python Implementation

🐍 Classification Metrics with Scikit-learn

📊 Regression Metrics